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Abstract 

This paper studies deep network architectures to address 
the problem of video classification. A multi-stream frame¬ 
work is proposed to fully utilize the rich multimodal infor¬ 
mation in videos. Specifically, we first train three Convo¬ 
lutional Neural Networks to model spatial, short-term mo¬ 
tion and audio clues respectively. Long Short Term Mem¬ 
ory networks are then adopted to explore long-term tempo¬ 
ral dynamics. With the outputs of the individual streams, 
we propose a simple and effective fusion method to gener¬ 
ate the final predictions, where the optimal fusion weights 
are learned adaptively for each class, and the learning pro¬ 
cess is regularized by automatically estimated class rela¬ 
tionships. Our contributions are two-fold. First, the pro¬ 
posed multi-stream framework is able to exploit multimodal 
features that are more comprehensive than those previously 
attempted. Second, we demonstrate that the adaptive fusion 
method using the class relationship as a regularizer out¬ 
performs traditional alternatives that estimate the weights 
in a “free” fashion. Our framework produces significantly 
better results than the state of the arts on two popular 
benchmarks, 92.2% on UCF-101 (without using audio) and 
84.9% on Columbia Consumer Videos. 

1. Introduction 

The problem of video classification based on semantic 
contents like human actions or complex events has been ex¬ 
tensively studied in the computer vision community. The 
fact that videos are intrinsically multimodal demands so¬ 
lutions that can explore not only static visual information, 
but also motion and auditory clues. Key to the develop¬ 
ment of video classification systems is the design of good 
features. Popular feature descriptors include the SIFT [28], 
the Mel-Frequency Cepstral Coefficients (MFCC) [47], the 
STIP [26] and the dense trajectories [44], which can be 
encoded into video-level representations by bag-of-words 
(Bqjy^dnfia^WfiridthSdbgfigSldSJSclBysfctiptofs,^ tlife, de^p 
neural networks that can learn features automatically from 
raw data have demonstrated strong performance in various 



Figure 1. Illustration of the proposed multi-stream video classifi¬ 
cation framework. 


domains. In particular, the convolutional neural networks 
(ConvNets) are very successful on image analysis tasks like 
object detection [13], object recognition [37, 41 ] and image 
segmentation [11]. However, for video classification, most 
deep network based approaches (e.g., [18, 21, 36]) demon¬ 
strated worse or similar results to the hand-engineered fea¬ 
tures [44]. This is largely due to the high complexity of 
the video data. Unlike images that only have static visual 
appearance information, videos also contain temporal mo¬ 
tions and auditory soundtracks. For example, a “diving” 
action video usually involves a sequence of atoms, such as 
“jumping from a platform”, “rotating in the air” and “falling 
into water”, accompanied by cheering or clapping sounds. 
Some approaches [18, 21, 36] only focused on the static 
frames and short-term motion clues captured by a few ad¬ 
jacent frames, which are apparently not sufficient. A few 
very recent studies attempted to use recurrent neural net¬ 
works (RNN) to model long-term temporal information and 
achieved competitive performance [32, 46]. Nevertheless, 
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the audio information has rarely been exploited. In addi¬ 
tion, most existing approaches fused the outputs of multiple 
networks in a very straightforward way [36], which could 
lead to sub-optimal performance. 

Realizing the above limitations, in this paper, we propose 
a multi-stream framework of deep neural networks to ex¬ 
ploit the multimodal clues for video classification. Figure 1 
illustrates the diagram of our approach. Three ConvNets 
are trained to model the static spatial information, short¬ 
term motion and auditory clues, respectively. The motion 
stream is computed on stacked optical flows over a short 
temporal windows and thus can only capture short-term mo¬ 
tion. In order to model the long-term temporal clues, we 
employ a Recuri'ent Neural Network (RNN) model, namely 
the Long Short Term Memory (LSTM), on the frame-level 
spatial and motion features extracted by the ConvNets. The 
LSTM encodes history information in memory units regu¬ 
lated with non-linear gates to discover temporal dependen¬ 
cies. To combine the outputs from different networks, we 
develop a simple yet effective fusion method to learn the op¬ 
timal fusion weights adaptively for each class. We propose 
to regularize the weight learning process using class rela¬ 
tionships estimated without using additional labels. This 
helps inject class context into the final predictions and thus 
can significantly improve the results. Our contributions are 
summarized as follows: 

1. We introduce a multi-stream framework that integrates 
spatial, short-term motion, long-term temporal and au¬ 
ditory clues in videos. We demonstrate that the multi¬ 
stream networks are able digest complementary infor¬ 
mation to receive signihcantly improved performance. 

2. We propose a simple and effective fusion method to 
combine the outputs of the individual networks. The 
method learns fusion weights adaptively for each class 
and is able to harness class relationships in the weight 
learning process. We empirically show that the class 
relationship regularizer is very effective. 

Incorporating the fusion method into the proposed multi¬ 
stream framework, we achieve superior performance on two 
popular benchmark datasets. 

2. Related Works 

As aforementioned, video classification has been exten¬ 
sively studied and signihcant efforts have been paid to de¬ 
sign hand-engineered features or classifiers. We focus the 
review on recent works related to our proposed approach. 

Motived by the promising results of deep networks 
(particularly the ConvNets) on image analysis tasks [41, 
37, 13], several works have exploited deep architectures 
for video classihcation. Ji et al. extended CNN models 
into spatial-temporal space by operating on stacked video 


frames [18]. Karparthy et al. compared several architec¬ 
tures for action recognition [21]. Tran et al. proposed to 
learn generic spatial-temporal features which can be com¬ 
puted efficiently [42]. Simonyan and Zisserman [36] intro¬ 
duced an interesting two-stream approach, where two Con¬ 
vNets are trained to explicitly capture spatial and short-term 
motion information using frames and stacked optical flows 
as inputs, respectively. Final predictions can be obtained 
by linearly averaging the prediction scores of the two Con¬ 
vNets. In this paper, we also adopt two similar ConvNets 
as [36]. Flowever, as the two-stream approach is not able 
to model the auditory and the long-term temporal clues, we 
adopt additional networks to build a more comprehensive 
framework. A novel fusion method is also proposed to com¬ 
bine the multi-stream outputs, which is better than the sim¬ 
ple linear fusion used in [36]. 

The RNN has been shown to be effective on many se¬ 
quential modeling tasks, such as speech recognition [14] 
and image/video description [9, 48]. For long-term tem¬ 
poral modeling of the video data, Srivastava et al. proposed 
an LSTM encoder-decoder framework to learn video repre¬ 
sentations in an unsupervised manner [39]. Donahua et al. 
[9] and Wu et al. [46] trained a two-layer LSTM network 
for action classihcation. Ng et al. [32] further demonstrated 
that a hve-layer LSTM network is slightly better. 

Fusion is needed to combine the outputs of separate pre¬ 
diction models. The simplest solution is linear weighted 
fusion, which has been adopted in many recent approaches 
like [36]. Nandakumar et al. performed score fusion using 
a method called likelihood ratio test [30]. More recently, 
Xu et al. [47] and Ye et al. [49] proposed robust late fusion 
methods by seeking a low rank matrix to remove the noise 
of individually trained classihers. Liu et al. [27] proposed 
to predict sample-specihc weights in the fusion process. 

There are many studies using context or class relation¬ 
ships to improve visual recognition performance. For in¬ 
stance, Rabinovich et al. utilized a Conditional Random 
Field (CRF) model to maximize object label agreement 
based on contextual relevance [34]. Deng et al. proposed to 
jointly train a hierarchy and exclusion graph model with a 
ConvNet to learn class relations for image classihcation [8]. 
Assari et al. exploited class co-occurrences for video classi¬ 
hcation [2]. Different from these works, we use class rela¬ 
tionship as a regularizer to learn fusion weights adaptively 
for each class. 


3. Methodology 

In this section, we hrst describe the individual streams 
and then introduce the proposed adaptive multi-stream fu¬ 
sion method, followed by implementation details. 


3.1. Multi-Stream ConvNets 


Carrying abundant multimodal information, videos nor¬ 
mally show the movements and interactions of objects un¬ 
der certain scenes over time, accompanied by human voices 
or background sounds. Therefore, video data can be nat¬ 
urally decomposed into spatial, motion and audio streams. 
The spatial stream consisting of individual frames depicts 
the static appearance information, while the motion stream 
captures object or scene movements demonstrated by con¬ 
tinuous frames. In addition, sounds in the audio stream 
provide crucial clues that are often complementary to the 
visual counterpart. Motivated by the recent two-stream ap¬ 
proach [36], we train three ConvNets to exploit the multi¬ 
modal information, as described below. 

In brief, the spatial ConvNet uses the raw frames as in¬ 
puts, where we adopt a deep architecture with superior per¬ 
formance on image recognition tasks [37]. It can effec¬ 
tively recognize certain video semantics that have clear and 
discriminative appearance characteristics. For the motion 
stream, we train a ConvNet model operating on stacked op¬ 
tical flows following [36]. More specifically, through com¬ 
puting displacement vectors in both horizontal and vertical 
ways, the optical flows encode subtle motion patterns of ob¬ 
jects between each pair of adjacent frames, which can be 
converted into two flow images as the inputs of the motion 
stream ConvNet. Previous studies have shown that further 
improvements can be obtained by stacking consecutive op¬ 
tical flow images in a short time window, owing to the inclu¬ 
sion of relatively more compact movements [36]. In order 
to leverage the audio information, we first apply the Short- 
Time Fourier Transformation to convert the 1 -d soundtrack 
into a 2-D image (namely spectrogram) with the horizontal 
axis and vertical axis being time-scale and frequency-scale 
respectively. Then we employ a ConvNet to operate on the 
spectrograms as suggested in [43]. Notice that the ConvNet 
is well suited for modeling audio signals based on spectro¬ 
grams with the weight sharing and max pooling mechanism 
to strive invariance of small frequency shifts [1]. 

3.2. Long Term Temporal Modeling 

As the motion stream ConvNet only captures short-term 
motion patterns, we further employ LSTM [16] to model 
long-term temporal clues in the visual channel. LSTM is 
a popular RNN model that incorporates memory cells with 
several gates to learn long-term dependencies without suf¬ 
fering from vanishing and exploding gradients as the tradi¬ 
tional RNNs [5]. It is able to exploit temporal information 
of a data sequence with arbitrary length through recursively 
mapping the input sequence to output labels with hidden 
LSTM units. Each of the units maintains a built-in memory 
cell, which stores information over time guarded by several 
non-linear gate units to control the amount of changes and 
influence of the memory contents. 



Figure 2. The structure of an LSTM unit. 


Figure 2 illustrates the typical structure of a hidden 
LSTM unit. In our framework, we denote Xj as the fea¬ 
ture representation of a video frame or a stacked optical 
flow image at the <-th time step. Generally, an LSTM 
maps an input sequence (xi,X 2 ,... ,xt) to output labels 
(yi: y 2 ) ■ • ■, yr) through computing activations of the units 
in the network recursively from f = 1 to f = T. At time t, 
the activation vectors of memory cell Ct, output gate o* and 
hidden state ht are computed as: 

Ct = ft © Ct_i + if © tanh(W3;cXt + W/^chf-i -F b^), 

Of — xo^t “F “F co^t “F bo), 

hf = Of © tanh(cf), (1) 

where he, '^xo, ho, W^o are the weight matri¬ 

ces connecting two different units, be, bo are the bias terms, 
a is the sigmoid function, and © is an element-wise prod¬ 
uct operator. Notice that if and ff are the activation vectors 
of input and forget gates, which are calculated with weight 
matrices as: 


ti — f^(M^a,fXf “F Wf^fhf—i “F WofCf_i “F bf), 

ff = CT(Wa;/Xf -F Whfht-1 + Wc/Cf_i -F b/). (2) 


From the above equations, the contents of the memory 
cell at the f-th time step Cf is computed as the weighted 
sum of the current inputs and the previous memory con¬ 
tents Cf_i. The input and forget gates {i.e., if and ff) impose 
regularization to determine whether to consider new infor¬ 
mation or forget old information. In addition, the output 
gate Of controls the amount of information from the mem¬ 
ory contents that is passed to the hidden state hf to influence 
the computation in the next time step. 

As a neural network, the LSTM model can be easily 
deepened by stacking the hidden states from a layer Z — 1 as 
inputs of the next layer 1. In order to obtain the prediction 
scores for a total of C classes at a time step t, a softmax 
layer is placed on top of the last LSTM layer L to estimate 
the posterior probability pc of the c-th class as: 


Pc = softmax(hf) 


exp(ue'^hf -F be) 

-F bc'Y 


( 3 ) 




















where Uc and be represent the corresponding weight vector 
and the bias term of the c-th class. Such an LSTM network 
can be trained using the Back-Propagation Through Time 
(BPTT) algorithm [15], which “unrolls” the model into a 
feed forward neural net and back-propagates to determine 
the optimal network parameters. We adopt the output from 
the last layer as the video-level prediction scores since this 
output is computed based on the information from the entire 
sequence. Our empirical results show that using the last 
layer output is better than pooling the predictions at all the 
time steps. 


3.3. Adaptive Multi-Stream Fusion 


Given the prediction scores of multiple network streams 
{i.e., the ConvNets and the LSTM), we are able to capture 
the video characteristics from different aspects. It is crit¬ 
ical to effectively fuse the multi-stream scores to generate 
the final predictions. Different semantic classes associate 
with the multiple streams with different strength. For ex¬ 
ample, some classes are strongly associated with particular 
objects which could be effectively recognized with the spa¬ 
tial stream, while others may contain dramatic movements 
so the short-term motion and the long-term temporal clues 
can contribute more significantly. Traditional fusion meth¬ 
ods are usually performed at the stream-level without con¬ 
sidering the class-specific preferences. In addition, most 
existing studies on model fusion neglected the class rela¬ 
tionships that can serve as complementary information for 
improved performance [34, 4, 8]. In the following we in¬ 
troduce the proposed adaptive multi-stream fusion method, 
which is able to determine the optimal fusion weights adap¬ 
tively for each class. The highly correlated classes are also 
automatically identified and their relationships are utilized 
in the method. 


Formally, we denote the prediction scores from the m- 
th stream as s™ G M*" (m = 1, • • • , M) with C being the 
number of classes, and let y be the final predicted labels. 
A straightforward way of late fusion is to compute the final 
prediction as y = /(s^, • • • ,s^). Here / is a transition 
function, which can be a linear function, a logistic func¬ 
tion, etc. However, such a late fusion approach treats all the 
classes uniformly without considering their different char¬ 
acteristics. 


Different from the uniform fusion methods, we at¬ 
tempt to adaptively integrate the predictions from multiple 
streams for each class by not only combining scores across 
streams but also utilizing class knowledge as a prior to pro¬ 
vide additional information. To this end, we first stack the 
multiple score vectors of a training sample n as a coefficient 

vector: s„, = [si^, • • • , • • • , G 


Then the best class-specific fusion weights can be learned 


with logistic regression as: 

W = arg min V log (l 4- exp [(1 - 2y„,c)s^We] ) , 

w,--- ,wc ^ 

n,c 

(4) 

where is the ground-truth label of the n-th sample for 
class c, and W = [wi,--- ,wc] G 

However, direct optimization with the above formulation 
often leads to over-fitting and produces limited performance 
on the test set. In order to alleviate this and take the class re¬ 
lationships into account, we use the relationships as a prior 
to guide the learning of the weights. More precisely, we 
first compute a correlation matrix V’" G R^^*^ of the 
classes for the m-th stream using the corresponding predic¬ 
tion scores, where each entry indicates the percentage 
of the samples with the ground-truth label of class i being 
wrongly classified into class j. The reason of using sepa¬ 
rate correlation matrix for each stream is that the captured 
class relationships in different streams are likely to be quite 
different. Next, we stack the similarity matrices of all the 
streams V = [V^, • • • , V™, • • • , V^] ^ to regularize the 
weight learning process as: 

min L(S,Y;W) + Ai|lW-V|i^, (5) 

w 

where the first term is the empirical loss that measures the 
discrepancy between the ground-truth labels Y and the pre¬ 
diction scores S, and the second term regularizes the fusion 
weights using the class correlation as a prior. For each simi¬ 
larity matrix V™, the non-diagonal entries demonstrate the 
similarities among different classes, which can be used to 
guide the weight learning process through borrowing infor¬ 
mation from highly related classes. 

In addition, we also incorporate an £i norm regulariza¬ 
tion to impose sparsity on the weight matrix, which, to some 
extent, can help avoid information sharing from irrelevant 
classes. With both regularization terms, we have following 
optimization problem: 

min L(S,Y;W) + Ai|1W-V||^ + A2||W|1,. (6) 

W 

In summary, by treating the class correlation matrix as a 
prior, our fusion approach minimizes an empirical loss reg¬ 
ularized by a sparsity constraint to effectively derive class 
adaptive fusion weights. 

Although the loss function in Equation 6 is convex, it is 
non-trivial to solve it due to the non-smooth term. To tackle 
the optimization problem efficiently, we adopt the proximal 
gradient descent method that splits the objective function 
into a smooth part and a non-smooth part: 

5 = L(S,Y;W)+Ai|1W-V||^, (7) 

/* = A2|1W|1,. (8) 




The update of W at the k \ iteration can be simply com¬ 
puted as: 

= Prox,,(W'= - Vp(W*^)), 

where Prox^^ denotes the soft-thresholding operator for the 

norm [10]. 

Note that the additional computational cost lies in the 
estimation of the proximal operator. Since it can be ana¬ 
lytically solved in linear time [3], the above optimization 
process is fairly efficient. 

3.4. Implementation Details and Discussions 

ConvNet Models. In this work, we adopt two Con- 
vNet architectures, the CNN_M [36] model for capturing 
the short-term motion and the audio clues and a recent 
deeper VGG_19 [37] architecture for the spatial stream. 
The CNNJVI is basically a variant of the AlexNet [23] with 
more filters included, which contains five convolutional lay¬ 
ers followed by three fully connected layers. The VGG_19 
not only reduces the size of the convolutional filters and 
the stride, but also extends the depth of the network to a 
total of 19 layers, equipping the architecture with the ca¬ 
pacity of learning more robust representations. These two 
deep networks achieved 13.5% [36] and 7.5% [37] top-5 er¬ 
ror rates on the ImageNet ILSVRC-2012 validation set, re¬ 
spectively. All the ConvNet models are trained using mini¬ 
batch stochastic gradient descent with a momentum fixed 
to 0.9. Our implementation is based on the publicly avail¬ 
able Caffe toolbox [19] with some modifications. The input 
video frame is uniformly fixed to the size of 224x224. In 
addition, we also perform simple data augmentations like 
cropping and flipping following [36]. 

The spatial and the audio ConvNets are first pre-trained 
using the ILSVRC-2012 training set with 1.2 million im¬ 
ages and then fine-tuned using the training video data. This 
strategy has been observed effective in [36] for the spatial 
stream, and we have observed it also helpful for the audio 
stream. To fine-tune the spatial and the audio ConvNets, we 
gradually decrease the learning rate from 10“^ to 10“^ after 
14K iterations, then to 10“® after 20K iterations. In addi¬ 
tion, dropout is applied to the fully connected layers with a 
ratio of 0.5 to avoid over-fitting. 

To train the motion ConvNet, we first compute optical 
flow using the GPU implementation of [6] and stack the op¬ 
tical flows in each 10-frame window to receive a 20-channel 
optical flow image as the input (one horizontal channel and 
one vertical channel for each frame pair). Unlike the spa¬ 
tial and the audio ConvNets, we train the motion ConvNet 
from scratch by adopting 0.7 dropout ratio and setting the 
learning rate to 10“^ initially, which is reduced to to 10“^ 
after lOOK iterations and then to 10“^ after 200K iterations. 
Note that we also tried to use the VGG_19 network to train 


the motion ConvNet, but observed worse results as the net¬ 
work contains much more parameters that cannot be well- 
tuned using the limited training video data. 

LSTM. We adopt the two-layer LSTM model proposed 
by Graves [15] for temporal modeling. Two models are 
trained with features extracted respectively from the first 
fully-connected layer of the spatial and the motion Con¬ 
vNets as inputs. Each LSTM has 1,024 hidden units in the 
first layer and 512 hidden units in the second layer. We uti¬ 
lize a parallel implementation of the BPTT algorithm with 
a mini-batch size of 10 to train the network weights, where 
the learning rate and momentum are set as 10“^ and 0.9. In 
addition, we set the maximal training iterations to be 150K. 
Note that, in this paper, we focus on a multi-stream frame¬ 
work by utilizing the audio signal as a single stream for 
video classification. Further decomposing the audio track 
into multiple segments to extract more detailed temporal au¬ 
dio dynamics is feasible. 

Fusion. As shown in Equation 6, the proposed adaptive 
fusion strategy seeks a tradeoff between the empirical loss 
and the two regularization terms. We uniformly fix A 2 to 
be 10“^ to encourage sparsity in the weight matrix. The 
parameter Ai is selected among {10“®, 10“"^, 10“^, 10“^} 
using cross-validation. 

Discussions. The proposed multi-stream framework has 
the capability of modeling video data comprehensively by 
adaptively fusing audio, static spatial, short-term motion 
and long-term temporal clues. As described above, such 
a framework consists of multiple separately trained deep 
networks. Although being feasible to jointly train the en¬ 
tire framework, it is complicated and computationally de¬ 
manding. A recent work performing joint training of the 
LSTM with a ConvNet improves the results on the UCF- 
101 benchmark from 70.5% (separate network training) to 
71.1% [9], which is not very significant. In addition, train¬ 
ing multiple deep networks separately makes the approach 
more flexible, where a component may be replaced with¬ 
out the need of re-training the entire framework. For in¬ 
stance, one can utilize more discriminative ConvNet mod¬ 
els like the GoogLeNet [41] and deeper RNN models [7] to 
replace the current ConvNet and LSTM parts respectively 
for better performance. Therefore, in this work, we focus 
on presenting a general framework for multi-stream video 
classification. With the proposed adaptive fusion method, 
such a multi-stream framework is empirically proved to be 
effective for the video classification task, as discussed in the 
following section. 

4. Experiments 

In this section, we report results on two popular datasets. 
Experiments are designed to study the effectiveness of each 
individual stream and the proposed adaptive multi-stream 
fusion method. 


4.1. Experimental Setup 

Datasets and Evaluation Measures. UCF-101 [38] is a 
widely adopted dataset for human action recognition, con¬ 
taining 13,320 video clips annotated into 101 action classes. 
All the video clips have a fixed frame rate of 25 fps with a 
spatial resolution of 320 x 240 pixels. This dataset is chal¬ 
lenging because most videos were captured under uncon¬ 
trolled environments with camera motion, cluttered back¬ 
grounds and large intra-class variations. We follow the sug¬ 
gested experimental protocol and report mean accuracy over 
the three training and test splits. 

The Columbia Consumer Videos (CCV) dataset [20] 
contains 9,317 YouTube videos and 20 classes. Most of the 
classes are events like “basketball”, “graduation ceremony” 
and “wedding dance”. A few are scenes and objects like 
“beach” and “dog”. Following [20], we adopt the suggested 
training and test split and compute the average precision 
(AP) for each class. Mean AP (mAP) is used to measure 
the overall performance on this dataset. 

The two datasets possess very different characteristics. 
Besides the difference of the defined semantic classes, the 
average video duration of CCV is 80 seconds, which is 
around ten times longer than that of UCF-101. Testing on 
these two datasets is helpful for evaluating the effectiveness 
and the generalization capability of our multi-stream classi¬ 
fication approach. 

Alternative Fusion Methods. To validate the effective¬ 
ness of our adaptive multi-stream fusion method, we com¬ 
pare with the following alternatives: (1) Average Fusion, 
where the mean scores of multiple networks are used as the 
final prediction; (2) Weighted Fusion, where the scores are 
fused linearly with weights estimated by cross-validation; 
(3) Kernel Average Fusion, where the scores are used as fea¬ 
tures and kernels computed from different network scores 
are averaged to train an SVM classifier; (4) Multiple Kernel 
Learning (MKL) Fusion, where the kernels are combined 
using the fp-norm MKL algorithm [22]; (5) Logistic Re¬ 
gression Fusion, where a logistic regression model is trained 
to estimate the fusion weights. 

4.2. Results and Discussions 

4.2.1 Multi-Stream Networks 

We hrst report the performance of each individual stream 
on both datasets. After that, average fusion is adopted to 
study whether two or more streams are complementary. The 
proposed adaptive fusion method will be evaluated later. 

Table 1 reports the results. Comparing the top two cells 
of results on UCF-101, it is interesting to observe that the 
spatial LSTM outperforms the spatial ConvNet and the mo¬ 
tion LSTM is also comparable to the motion ConvNet. This 
is largely due to the fact that the long-term temporal clues 
are fully discarded in the ConvNet based classification. 



UCF-101 

CCV 

Spatial ConvNet 

80.4 

75.0 

Motion ConvNet 

78.3 

59.1 

Spatial LSTM 

83.3 

43.3 

Motion LSTM 

76.6 

54.7 

Audio ConvNet 

16.2* 

21.5 

ConvNet (spatialH-motion) 

86.2 

75.8 

LSTM (spatialH-motion) 

86.3 

61.9 

ConvNetH-LSTM (spatial) 

84.0 

77.9 

ConvNetH-LSTM (motion) 

81.4 

70.9 

ConvNetH-LSTM (spatialH-motion) 

90.1 

81.7 

All the streams 

90.3 

82.4 


Table 1. Performance of each individual stream and their average 
fusion (indicated by ‘Note that only the videos of 51 classes 
in UCF-101 contain audio soundtracks. The audio ConvNet can 
produce an accuracy of 32.1% on the 51-class subset. 

which contain valuable information that can be exploited 
by the LSTM. 

On the CCV dataset, the ConvNet achieves significantly 
better results than the LSTM on both spatial and motion 
streams. This is because the classes in CCV are either high- 
level events or objects/scenes. Compared with human ac¬ 
tions, the temporal clues of these classes are more obscure 
and thus difficult to be captured. Also, the CCV videos are 
temporally untrimmed, which may contain significant por¬ 
tions of contents irrelevant to the classes, making the tem¬ 
poral modeling task even more difficult. 

The audio ConvNets operated on spectrograms produce 
16.2% on UCF-101 and 21.5% on CCV. Note that only 51 
classes in UCF-101 have audio signals, and the performance 
on the 51-class subset is actually 32.1%. The audio stream 
is much worse than the spatial and the motion streams on 
both datasets, confirming that the visual channel are more 
informative than the audio counterpart. 

Next, we evaluate the combinations of multiple networks 
to study whether fusion can compensate the limitations of a 
single stream in describing complex video data. The sim¬ 
ple average fusion is adopted. Results are summarized in 
the bottom three groups of Table 1. We hrst assess the 
gain from integrating the spatial and the motion information 
modeled by ConvNet and LSTM respectively. On UCF- 
101, signihcant improvements (about 6% for ConvNet and 
3% for LSTM) are observed over the best single stream 
results. The gain on CCV is consistent but not as signif¬ 
icant as that on UCF-101, indicating that the short-term 
motion is more critical for human action analysis. Note 
that the average fusion of the spatial and the motion Con¬ 
vNets follows the same idea of the two-stream approach 
proposed in [36]. Our implementation of this approach pro¬ 
duces slightly worse performance than that originally re¬ 
ported in [36] (86.2% vs. 88.0%). 

















We also fuse ConvNet with LSTM separately on both 
streams to investigate the contribution of the long-term tem¬ 
poral modeling. Overall, we observe very consistent im¬ 
provements on both datasets. In particular, on CCV, al¬ 
though the individual LSTM model is worse than ConvNet, 
the combination of them leads to significant improvements. 
Especially, a gain of nearly 12% is obtained on the mo¬ 
tion stream. These results show that the long-term temporal 
clues are highly complementary to the ConvNet-based pre¬ 
dictions, even in the case of modeling complex contents in 
the long CCV videos, which is fairly appealing. 

Finally, the combination of ConvNet and LSTM on both 
streams, indicated by “ConvNet-tLSTM (spatial-tmotion)”, 
achieves 90.1% and 81.7% on UCF-101 and CCV respec¬ 
tively. Further adding the audio ConvNet (“all the streams”) 
can improve the results particularly on CCV which contains 
many classes that can be partly revealed by auditory clues 
{e.g., cheering sounds in the sports events). In summary, the 
fusion results clearly demonstrate that all the multimodal 
clues in our approach are useful and should be adopted in a 
successful video classification system. 

4.2.2 Adaptive Multi-Stream Fusion 

In this subsection, we evaluate the proposed adaptive multi¬ 
stream fusion approach, and compare it with the alternative 
methods. Table 2 gives the results. We see that all the meth¬ 
ods produce better results than the individual streams. The 
simple average fusion and weighted fusion are slightly bet¬ 
ter than the learning based kernel fusion and logistic regres¬ 
sion fusion, indicating that learning to fuse the prediction 
scores in a “free” manner is prone to over-fitting. Kernel av¬ 
erage fusion shows slightly better results than MKL, which 
is consistent with the observations in several previous stud¬ 
ies like [12]. 

Our proposed adaptive multi-stream fusion (the bottom 
row) outperforms all the alternatives with clear margins. To 
investigate the contributions of the two regularizers in our 
approach, we set Ai and A 2 to be zero respectively. As can 
be seen, the class relationship regularizer (A 2 = 0) plays a 
more important role than the sparsity regularizer (Ai = 0). 
This corroborates the effectiveness of using the class rela¬ 
tionships, which not only brings in useful contextual infor¬ 
mation but also helps prevent over-htting. The two regu¬ 
larizers are complementary as the sparsity inducing norm 
further enhances robustness by alleviating incorrect infor¬ 
mation sharing. Note that when eliminating both regulariz¬ 
ers, our fusion approach degenerates to the standard logistic 
regression fusion. 

The contribution of the audio clues is similar on both 
datasets (“-A” indicates the same approach without using 
the audio ConvNet). Audio improves just 0.4% on UCF- 
101 because only half of the video clips contain sound¬ 



UCF-101 

CCV 

Average fusion 

90.3 

82.4 

Weighted fusion 

90.6 

82.7 

Kernel average fusion 

90.2 

82.1 

MKL fusion 

89.6 

81.8 

Logistic regression fusion 

89.8 

82.0 

Adaptive multi-stream fusion (Ai=0) 

90.9 

82.8 

Adaptive multi-stream fusion (A2=0) 

91.6 

83.7 

Adaptive multi-stream fusion (-A) 

92.2 

84.0 

Adaptive multi-stream fusion 

92.6 

84.9 


Table 2. Comparison of fusion methods. “-A” indicates that the 
audio stream ConvNet is not adopted. See texts for discussions. 


tracks. Overall, the gain from the adaptive multi-stream fu¬ 
sion is more signihcant on UCF-101 as it has more classes 
for semantic sharing. Figure 3 further shows the per-class 
performance on CCV, where we can see that the fusion leads 
to very consistent and significant improvements for all the 
classes. 

4.2.3 Comparison with State of the Arts 

We compare our approach with the state of the arts on both 
datasets. Results are listed in Table 3. Our proposed multi¬ 
stream approach achieves the highest performance on both 
datasets. On UCF-101, many works with competitive re¬ 
sults are based on the hand-engineered dense trajectory fea¬ 
tures [45, 25], while our approach fully relies on the deep 
networks. Compared with the original result of the two- 
stream approach [36], our approach captures a more com¬ 
prehensive set of useful clues with a more effective fusion 
strategy. Note that a gain of even just 1% on the widely 
adopted UCF-101 dataset is generally considered as a sig¬ 
nihcant progress. 

In addition, the recent works in [9, 39, 46, 32] also 
adopted the LSTM to model the temporal clues for video 
classihcation and reported promising performance, but did 
not explore the audio stream and employ advanced fusion 
strategies. Zha et al. [50] combined the ConvNet features 
with the dense trajectories [44] to achieve very competitive 
results. 

On the CCV dataset, all the recent approaches were 
developed based on multiple features, either the hand- 
engineered descriptors or the ConvNet-based representa¬ 
tions. Our approach produces better results than all of them. 

5. Conclusions 

We have presented a multi-stream framework of deep 
networks for video classihcation. The framework har¬ 
nesses multimodal features that are more comprehensive 
than those previously adopted. Specihcally, standard Con- 
vNets are applied to audio spectrograms, visual frames and 










Figure 3. Per-class performance on CCV. Adaptive fusion of the multi-stream deep networks produces consistently better results than the 
individual streams on all the classes. 


UCF-101 

CCV 

Donahue et al. [9] 

82.9 

Lai et al. [24] 

43.6 

Srivastava et al. [39] 

84.3 

Jiang et al. [20] 

59.5 

Wang et al. [45] 

85.9 

Xu et al. [47] 

60.3 

Tran et al. [42] 

86.7 

Ma et al. [29] 

63.4 

Simonyan et al. [36] 

88.0 

Jhuo et al. [17] 

64.0 

Ng et al. [32] 

88.6 

Ye et al. [49] 

64.0 

Lan et al. [25] 

89.1 

Liu et al. [27] 

68.2 

Zha et al. [50] 

89.6 

Wu et al. [46] 

83.5 

Wu et al. [46] 

91.3 



Ours (-A) 

92.2 

Ours (-A) 

84.0 

Ours 

92.6 

Ours 

84.9 


Table 3. Comparison with state-of-the-art results. Our approach 
produces to-date the highest reported results on both datasets. 
“Ours (-A)” indicates the same framework without using the audio 
stream ConvNet. 


Stacked optical flows to exploit the audio, spatial and short¬ 
term motion clues in videos, respectively. LSTM is further 
adopted on the spatial and the short-term motion features 
from the ConvNets for long-term temporal modeling. The 
outputs from the different streams are then fused using a 
novel method that adaptively learns the fusion weights for 
each class. Through imposing regularizations with the prior 
information and the sparsity, the weight learning process ex¬ 
plores semantic class correlations, while suppressing inap¬ 
propriate knowledge sharing among irrelevant classes. Our 
results confirm that all the adopted streams are effective for 
modeling not only simple human actions in short clips but 
also complex events in temporally untrimmed videos on the 
Internet. Combining all the streams by our proposed adap¬ 
tive fusion method outperforms peer approaches with sig¬ 
nificant margins on two popular benchmarks. 

The work in this paper is among the very few studies 
showing strong video classification performance using deep 
networks. As aforementioned, unlike the spatial ConvNet 


that can be trained by fine-tuning a model pre-trained on 
the ImageNet dataset, the motion ConvNet has to be trained 
from scratch on videos. Therefore, one promising future di¬ 
rection is to pre-train the motion ConvNet using large video 
datasets like the Sports-IM [21], which may improve the 
results significantly. 
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